A Novel algorithm for identifying low-complexity regions in a protein sequence
نویسندگان
چکیده
MOTIVATION We consider the problem of identifying low-complexity regions (LCRs) in a protein sequence. LCRs are regions of biased composition, normally consisting of different kinds of repeats. RESULTS We define new complexity measures to compute the complexity of a sequence based on a given scoring matrix, such as BLOSUM 62. Our complexity measures also consider the order of amino acids in the sequence and the sequence length. We develop a novel graph-based algorithm called GBA to identify LCRs in a protein sequence. In the graph constructed for the sequence, each vertex corresponds to a pair of similar amino acids. Each edge connects two pairs of amino acids that can be grouped together to form a longer repeat. GBA finds short subsequences as LCR candidates by traversing this graph. It then extends them to find longer subsequences that may contain full repeats with low complexities. Extended subsequences are then post-processed to refine repeats to LCRs. Our experiments on real data show that GBA has significantly higher recall compared to existing algorithms, including 0j.py, CARD, and SEG. AVAILABILITY The program is available on request.
منابع مشابه
Detection and Discrimination of Theileria annulata and Theileria lestoquardi by using a Single PCR
The aim of this study was to detect and differentiate Theileria annulata and T. lestoquardi (hirci) by PCR. Members of the genus Theileria are tick-borne hemoprotozoan parasites those cause fatal and enervating diseases of cattle and sheep in Iran . In order to develop a specific method for detecting and identification of Theileria species, specific primers from the surface protein (SP) seque...
متن کاملتخمین مکان نواحی کدکننده پروتئین در توالی عددی DNA با استفاده پنجره با طول متغیر بر مبنای منحنی سه بعدی Z
In recent years, estimation of protein-coding regions in numerical deoxyribonucleic acid (DNA) sequences using signal processing tools has been a challenging issue in bioinformatics, owing to their 3-base periodicity. Several digital signal processing (DSP) tools have been applied in order to Identify the task and concentrated on assigning numerical values to the symbolic DNA sequence, then app...
متن کاملA method for identifying software components based on Non-dominated Sorting Genetic Algorithm
Identifying the appropriate software components in the software design phase is a vital task in the field of software engineering and is considered as an important way to increase the software maintenance capability. Nowadays, many methods for identifying components such as graph partitioning and clustering are presented, but most of these methods are based on expert opinion and have poor accur...
متن کاملON FUZZY NEIGHBORHOOD BASED CLUSTERING ALGORITHM WITH LOW COMPLEXITY
The main purpose of this paper is to achieve improvement in thespeed of Fuzzy Joint Points (FJP) algorithm. Since FJP approach is a basisfor fuzzy neighborhood based clustering algorithms such as Noise-Robust FJP(NRFJP) and Fuzzy Neighborhood DBSCAN (FN-DBSCAN), improving FJPalgorithm would an important achievement in terms of these FJP-based meth-ods. Although FJP has many advantages such as r...
متن کاملA novel chimeric recombinant protein PDHB-P80 of Mycoplasma agalactiae as a potential diagnostic tool
The aim of this study was to construct, expression of a novel recombinant chimeric protein consisting of Pyruvate dehydrogenase beta subunit (PDHB) and high antigenic region of integral membrane lipoprotein P80 of Mycoplasma agalactiae as a potential diagnostic tool. The full-length sequence of pdhb and a portion of antigenic regions of P80 were selected and analyzed by CLC ma...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Bioinformatics
دوره 22 24 شماره
صفحات -
تاریخ انتشار 2006